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Abstract 

We develop two approaches for analyzing the approximation error bound for the Nystrom 
method, one based on the concentration inequality of integral operator, and one based on 
the compressive sensing theory. We show that the approximation error, measured in the 
spectral norm, can be improved from 0(N/y/m) to 0(N/m 1 ^ p ) in the case of large eigen- 
gap, where N is the total number of data points, m is the number of sampled data points, 
and p £ (0, 1/2) is a positive constant that characterizes the eigengap. When the eigenval- 
ues of the kernel matrix follow a p-power law, our analysis based on compressive sensing 
theory further improves the bound to 0(N/m p ~ 1 ) under an incoherence assumption, which 
explains why the Nystrom method works well for kernel matrix with skewed eigenvalues. 
We present a kernel classification approach based on the Nystrom method and derive its 
generalization performance using the improved bound. We show that when the eigenval- 
ues of kernel matrix follow a p-powcr law, we can reduce the number of support vectors 
to N 2p /( p a number less than N when p > 1 + V%, without seriously sacrificing its 
generalization performance. 



1. Introduction 

The Nystrom method has been widely applied in machine learning to approximate large 
kernel matrices to speed up kernel algorithms (Williams and Seeger, 2001; Drineas and 
Mahoney, 2005; Fowlkes et al., 2004; Kumar et al., 2009; Silva and Tenenbaum, 2003; 
Piatt, 2004; Talwalkar et al., 2008; Zhang et al., 2008; Belabbas and Wolfe, 2009; Talwalkar 
and Rostamizadeh, 2010; Cortes et al., 2010). In order to evaluate the quality of the 
Nystrom method, we typically bound the norm of the difference between the original kernel 
matrix and the low rank approximation created by the Nystrom method. Several analysis 
were developed to bound the approximation error of the Nystrom method (Drineas and 
Mahoney, 2005; Kumar et al., 2009; Belabbas and Wolfe, 2009; Li et al., 2010; Talwalkar 
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and Rostamizadeh, 2010; Mackey et al., 2011; Gittens, 2011). Most of them focus on 
additive error bound, and base their analysis on the theoretical results from (Drineas and 
Mahoney, 2005). When the target matrix is of low rank, significantly better bounds for the 
approximation error of the Nystrom method were given in (Talwalkar and Rostamizadeh, 
2010) and (Mackey et al., 2011). They are further generalized to kernel matrix of an 
arbitrary rank by a relative error bound in (Gittens, 2011). Although a relative error 
bound is usually tighter than an additive bound (Mahoney, 2011), the relative error bound 
in (Gittens, 2011) is proportional to N, where N is the total number of data points, making 
it unattractive for kernel matrix of very large size. In this study, we focus on the additive 
error bound of the Nystrom method for general matrices, and will compare our results 
mainly to the ones stated in (Drineas and Mahoney, 2005) l . Below, we review the main 
results in (Drineas and Mahoney, 2005) and their limitations. 

Let K G M. NxN be the kernel matrix to be approximated, and Aj, i = 1, . . . , iV be the 
eigenvalues of K ranked in the descending order. Let K(r) be an approximate kernel matrix 
of rank r generated by the Nystrom method. Let m be the number of columns sampled from 
K used to construct K(r). Then, under the assumption K iti = O(l), Drineas and Mahoney 
(2005) showed that for any m uniformly sampled columns 2 , with a high probability, 

\\K-K(r)\\ 2 <\ r+l + 

where || • ||2 stands for the spectral norm of a matrix. By setting r = m, the bound in (1) 
becomes 

\\K-K(m)\\ 2 <\ m+1 + o(J^. (1) 

The main problem with the bound in (1) is its slow reduction rate in the number 
of sampled columns (i.e., (^(m" 1 / 2 )), implying that a large number of samples is needed 
in order to achieve a small approximation error. In this study, we aim to improve the 
approximation error bound in (1) by considering two special cases of the kernel matrix K. 
In the first case, we assume there is a large eigengap in the spectrum of K . More specifically, 
we assume there exists a rank r G [N] such that A r = £l(N/m p ) and A r +i = 0(N/m 1 ~ p ), 
where p < 1/2. Here, parameter p is introduce to characterize the eigengap A r — A r +i: the 
smaller the p, the larger the eigengap will be. We show that the approximation error bound 
is improved to 0(N/m 1 ~ p ) in the case of large eigengap. The second case assumes that the 
eigenvalues of K follow a p-power law with p > 1. We show that the approximation error 
is improved to 0(N/m p ~ 1 ) provided that the eigenvector matrix satisfies an incoherence 
assumption 3 . This result explains why the Nystrom method works well for kernel matrices 
with skewed eigenvalue distributions (Talwalkar and Rostamizadeh, 2010). 

The second contribution of this study is a kernel classification algorithm that explicitly 
explores the improved bounds of the Nystrom method developed here. We show that when 

1. For completeness, we did include the comparison to the relative error bound in (Gittens, 2011) in the 
later remarks. 

2. Although the main results in (Drineas and Mahoney, 2005) use a data dependent sampling scheme, it 
was stated in the original paper that the results also hold for uniform sampling. 

3. A similar assumption was used in the previous analysis of the Nystrom method (Talwalkar and Ros- 
tamizadeh, 2010; Mackey et al., 2011; Gittens, 2011). 
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the eigenvalues of the kernel matrix follow a p-power law with p > 1, we can construct a 
kernel classifier that yields a similar generalization performance as the full version of kernel 
classifier but with no more than N 2p ^ p -1 ) support vectors, which is sublinear in N when 
p > (1 + \/2). Although the generalization error bound of using the Nystrom method for 
classification has been studied in (Cortes et al., 2010), to the best of knowledge, this is the 
first work that bounds the number of support vectors using the analysis of the Nystrom 
method. 



2. Notations and Background 

Let V = {xi, . . . , xat} be a collection of N samples, where Xj S X, and K = [/c(xj, xj)]]y x iy 
be the kernel matrix for the samples in T>, where «(-, •) is a kernel function. For simplicity, we 
assume k(x, x) < 1 for any x £ X. We denote by (vj, Xi), i = 1, . . . , N the eigenvectors and 
eigenvalues of K ranked in the descending order of eigenvalues, and by V = (v 1; • • • , vjy) 
the orthonormal eigenvector matrix. In order to build the low rank approximation of kernel 
matrix K, the Nystrom method first samples m < N examples randomly from T>, denoted 
by T> = {xi, . . . ,x m }. Let K = [/c(xj, Xj)] mxm measure the kernel similarity between any 
two samples in T> and Kb = [k(xj, Xj)]jvxm measure the similarity between the samples in 
T> and T>. Using the samples in T>, with rank r set to m (or the rank of K if it is less 
than m), the Nystrom method approximates K by K^K^ K b J , where denote the pseudo 
inverse of K. Our goal is to provide a high probability bound for the approximation error 
K — KbK^K^ . We choose r = m (or the rank of K) because according to (Drineas and 

Vlahoney, 2005; Kumar et al., 2009), it yields the best approximation error for a non-singular 
kernel matrix. 

In this study, we focus on the spectral norm for measuring the approximation error, 
which is particularly suitable for kernel classification (Cortes et al., 2010). We also restrict 
the analysis to the uniform sampling for the Nystrom method. Although different sampling 
approaches have been suggested for the Nystrom method (Drineas and Mahoney, 2005; 
Kumar et al., 2009; Zhang et al., 2008; Belabbas and Wolfe, 2009), according to (Kumar 
et al., 2009), for real- world datasets, uniform sampling is the most efficient and yields 
performance comparable to the other sampling approaches. We notice that in (Belabbas 
and Wolfe, 2009), the authors show a significantly better approximation bound for the 
Nystrom method when employing the determinantal process (Hough et al., 2006) for column 
selection; however, it is important to point out that the determinantal process is usually 
computationally expensive as it requires computing the determinant of the submatrix for the 
selected columns/rows, making it unsuitable for the case when a large number of columns 
are needed to be sampled. 

Our analysis for the Nystrom method extensively exploits the properties of the integral 
operator. This is in contrast to most of the previous studies for the Nystrom method that 
rely on matrix analysis. The main advantage of using the integral operator is its convenience 
in handling the unseen data points (i.e., test data), making it attractive for the analysis of 
generalization error bounds. In particular, we introduce a linear operator defined over 
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the samples in T>. For any function /(•), operator Ln is denned as 



1 - 



i=i 

It can be shown that the eigenvalues of the operator Ljy are Xi/N,i = 1, ... ,7V (Smale 
and Zhou, 2009). Let ^i(-), . . . , </?jvQ be the corresponding eigenfunctions of Ljy that are 
normalized by functional norm, i.e., {<Pi,(Pj)H K = ^(hj)j^ < * < J < -Wj where (-,-}-h k 
denotes the inner product in % K . According to (Smale and Zhou, 2009), the eigenfunctions 
satisfy 

N 



y/Xj<Pj(') = ^2VijK(xi,-),j = !,-■■ ,N, (2) 

i=l 

where Vij is the (i, j)th element in V. Similarly, we can write k(xj, •) by its eigen-expansion 
as 

N 

K ( x i' ') = XI V** V 3,i < Pi(')>3 = l,---,N. (3) 

i=i 

Furthermore, let L m be an operator defined on the samples in T>, i.e., 



1 m 

M/KO = -!>(*, ■)/(*)■ 

rrj ^ — * 



m 

1=1 



Finally we denote by {f,g)% K and ||/||% K the inner product and function norm in Hilbert 
space respectively, and denote by and ||L||2 the Hibert Schmid norm and spectral 

norm of a linear operator L, respectively, i.e. 



W L \\hS = Jy2(<Pi,L(pj)l and ||L|| 2 = max : \\Lf\\ HK , 

where = 1, • • • , } is a complete orthogonal basis of T~L K . The two norms are the analogs 
of Frobenius and spectral norm in Euclidean space, respectively. In the following analysis, 
omitted proofs are presented in the appendix. 

3. Approximation Error Bound by the Nystrom Method 

Our first step is to turn \\K — K^K^Kj^ into a functional approximation problem. To this 
end, we introduce two sets: 

U a = span (k(xi, •),-■■, «(x m , •)) 



{N N } 

/(•) = J> K ( Xl ,.): J>2<1 , 
i=l i=l J 



where % a is the subspace spanned by kernel functions defined on the samples in T>, and Tib 
is a subset of a functional space spanned by kernel functions defined on the samples in T> 
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with bounded coefficients. Using the eigen-expansion of k(xj, •) in (3), it is straightforward 
to show that T-L b can be rewritten in the basis of the eigenfunctions {<£i}f = i 



{N N > 

/(■) = ^^^ i (-):^^ 2 <lL 
8=1 1=1 J 



Define £{g,% a ) as the minimum error in approximating a function g £ by functions 
in H a , i.e., 

£{g,Ua) = mm ||/ - 

JszrLa 



H, 



Define £(W a ) as the worst error in approximating any function g £ T~L b by functions in % a , 
i.e., 



£{U a ) = max £{g,U a ). 



(4) 



K - K b K^K h T 



with £(?£„). 



The following proposition connects 
Proposition 1 For any random samples xi, . . . , x m , we foave 

K-K b K^Kj\\ 2 = £{U a ). 
Proof Since g £ and / £ T~L a , we can write 5 and / as 

N m 

g = ^ u M*i, •) and / = ^ 2i«(xi 



i=l 



i=l 



where u = (u\, . . . ,un) t £ R satisfies 1 1 u 1 1 2 < 1 and z = (21, ... , 2 m ) T £ 
can rewrite £(g,H a ) as an optimization problem in terms of z, i.e., 

£(0 %) = m i n z T Kz - 2u T K b z + u T iTu 

zGM m 

= u T (K - K b K^Kj) u, 

and therefore 



We thus 



£{U a ) = max £{g,'H a ) 



max u T \K - K b K^K b T ) u 



l|u|| 2 <l 



K - K b K^K~l 



5 



Remark 2 We can restrict the space T~L a to its subspace T~L r a = Z » K (^> ') '■ z ^ span(vi, . . . , vy) j , 

where Vj, z = 1, . . . , r are f/ie first r eigenvectors of K, to conduct the analysis for the rank 
r < m approximation of the Nystrom method. 

To proceed our analysis, for any r £ [N] we define 

7i r = span(v?i (•),•• • 

% r = span(yv+i(-), • • • , <Pn(-)), 



n = /(■) = £ WawO :E^ 2 <i > 

I i=l i=l J 

r N—r N-r *\ 

% = { /(■) = XI ^V^+r^+r(-) : ^ - 1 f ' 



I. i=l i=l ; 

Define £(% a ,r) = m&x£(g,H a ) as the worst error in approximating any function g £ ~}i r h 
by functions in T~L a . The proposition below bounds £(7i a ) by £(H a ,r). 
Proposition 3 For any r £ [N], we have 

£(H a ) < max (S(H a ,r), X r +i) < £{U a , r) + A r+ i. 

Proof We first note that for any / £ 1i a can be written as / = /1 + /2, where f\ £ / H a f\'H r , 
and /2 £ % a H For any 5 £ we can write g = 51 + 52, where 51 £ y/1 — o"Hl, 
gi £ VdV^, and 5 £ [0,1]. Using these notations, we rewrite £(T~L a ) as 

£{n a ) 

= max min ||/i - 5i|| 2 + II/2 - 52||^ K 

Se[0,l] hen a ^n r 

.92 G 

< max (1 — 5) max min II f — q\\ v + 5 max ||g||2/ 
" <5e[o,i] V geUlfeUanHr IU 9 e% K 

= max < (1 — <5) max min II f — oil?/ + <5 max ||g||2/ > 
<5G[0,1]\ V 'g&%feHa" J " Wk g€% 

= max (1 - 5)£{U a ,r) + <5A r+ i = max r), A r+ i) , 

<5e[o,i] 

where the second equality follows that for any g £ W b , min ||/ — = min ||/ — g\\^ , 
and the last inequality follows the definition of £(7i a ,r). ■ 



As indicated by Proposition 3, in order to bound the approximation error £(T~L a ), we 
can bound £( / H a ,r), namely the approximation error for functions in the subspace spanned 
by the top eigenfunctions of L^. In the next two subsections, we discuss two approaches 
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for bounding £(H a ,r): the first approach relies on the concentration inequality of integral 
operator (Smale and Zhou, 2009), and the second approach explores the compressive sensing 
theory (Candes and Romberg, 2007). Before proceeding to upper bound £(H a ), we first 
provide a lower bound for £(H a ). 

Theorem 4 There exists a kernel matrix K £ M> NxN with all its diagonal entries being 1 
such that for any sampling strategy that selects m columns, the approximation error of the 
Ny strom method is lower bounded by fl(^), i.e., 



K 



> n 



N 



provided N > 64[ln4] 2 m 2 



Remark 5 Theorem 4 shows that the lower bound for the approximation error of the 
Nystrom method is Q(N/m). The analysis developed in this work aims to bridge the gap 
between the known upper bound (i.e., 0{N / y/m)) and the obtained lower bound. 



3.1 Bound for £(H a ,r) using Concentration Inequality of Integral Operator 

In this section, we bound £(H a ,r) using the concentration inequality of integral opera- 
tor. We show that the approximation error of the Nystrom method can be improved to 
0(N/m 1 ~ p ) when there is a large eigengap in the spectrum of kernel matrix K, where 
p < 1/2 is introduced to characterize the eigengap. We first state the concentration in- 
equality of a general random variable. 

Proposition 6 (Proposition 1 (Smale and Zhou, 2009)) Let £ be a random variable on 
(X,Px) with values in a Hilbert space || ■ ||). Assume ||£|| < M < oo is almost sure. 
Then with a probability at least 1 — 5, we have 



1 m 



i=l 



< 



AM ln(2/<5) 



m 



The approximation error of the Nystrom method using the concentration inequality is 
given in the following theorem. 

Theorem 7 With a probability at least 1 — 5, for any r £ [N], we have 

16[ln(2 /5)] 2 N 2 



K - K h K^K~l 



< 



+ A 



r+l- 



We consider the scenario where there is very large eigengap in the spectrum of kernel 
matrix K. In particular, we assume that there exists a rank r and p G (0, 1/2) such that 
A r = Vt{N/m p ) and A r+ i = 0{N/m l ' p ). Parameter p is introduced to characterize the 
eigengap which is given by 



\r — A 



r+l 



N 



m 



l-p 



mP 



1 



1 



rn 



l-2p 
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Evidently, the smaller the p, the larger the eigengap. When p = 1/2, the eigengap is small. 
Under the large eigengap assumption, the bound in Theorem 7 is simplified as 



K-K b K^Kj <o(-%-) 
2 ym 1 p J 



(5) 



Compared to the bound in (1), the bound in (5) improves the approximation error from 
0{N/y/m) to 0{N/m l -P), when p < 1/2. 

To prove Theorem 7, we define two sets of functions 

H c = \ h = E w ^m{-) ■ 4 E w ^ ^ 1 1 ' 

I i=l i=l J 

H r d ={feH K :\\f\\ 2 nK <N 2 /\ r }. 

where r corresponds to the rank with a large eigengap. It is evident that T-L r c C 7~L T d ; and for 
any g £ Til, it can also be written as g = L^[h], where h G W c . 
Using W c and H d , we have 

£(Wa,i~) = max£(g, % a ) = max min WLnrh — f\\<u 
v ; se^j v ; henifeHa" Hk 

< max min \\Lwh — f II?/ . 
~ heH r d feHa Hk 

By constructing / as L m [h] we can bound £(T-L a ,r) as 

£(fta,r) < max min \\L N {h) - ff HK 

< max ||(Z,jv - L m )h\\^ K 

iV 2 

< ||Ljv — i m |2-r- 

A r 

2 

< ||Ljv - L m \\ HS — , (6) 

where the last step follows the fact ||Ljv — L m \\2 < ||£jv — -^m||i?5- The following corol- 
lary allows us to bound the difference between Ln and L m and follows immediately from 
Proposition 6. 

Corollary 8 With a probability 1 — 6, we have 

41n(2/<5) 
\\^N — L m\\HS S j= — • 

Finally, Theorem 7 follows directly the inequality in (6) and the result in Corollary 8. 



3.2 Bound for £(H a ,r) using Compressive Sensing Theory 

In this subsection, we aim to develop a better error bound for the Nystrom method for 
kernel matrices with eigenvalues that follow a power law distribution. Our analysis ex- 
plicitly explores some of the key results in the theory of compressive sensing (Candes and 
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Romberg, 2007; Donoho, 2006). To this end, we first introduce the definition of the power 
law distribution of eigenvalues (Koltchinskii and Yuan, 2010; Kloft and Blanchard, 2011). 
The eigenvalues <7j,i = 1,... ranked in the non- increasing order follows a p-power law 
(distribution) if there exists constant c > such that 



In the sequel, we assume the normalized eigenvalues Xi/N,i = 1, . . . , N (i.e., the eigen- 
values of the operator Ln), follow a p-power law distribution 4 . A well-known example of 
kernel with a power law eigenvalue distribution (Koltchinskii and Yuan, 2010) is the kernel 
function that generates Sobolev Spaces W a,2 (T d ) of smoothness a > d/2, where T d is d- 
dimensional torus. Its eigenvalues follow a p-power law with p = 2a > d. It is also observed 
that the eigenvalues of a Gaussian kernel by appropriately setting the width parameter 
follow a power law distribution (Ji et al., 2012). 

In order to exploit the compressive sensing theory (Candes and Romberg, 2007), we 
introduce the definition of the coherence \i for the eigevenvector matrix V = (vi, . . . , Vjv) 
as 



Intuitively, the coherence measures the degree to which the eigenvectors in V are correlated 
with the canonical bases. According to the theory of compressive sensing, highly coherent 
matrices are difficult (even impossible) to be recovered by matrix completion with random 
sampling. As observed in previous studies (Talwalkar and Rostamizadeh, 2010) and seen 
later in our analysis, the coherence of V also plays an important role in measuring the 
approximation performance of the Nystrom method using an uniform sampling. 

The coherence measure was first introduced into the error analysis of the Nystrom 
method by Talwalkar and Rostamizadeh (Talwalkar and Rostamizadeh, 2010). Their anal- 
ysis shows that a low rank kernel matrix with incoherent eigvenvectors (i.e., with low coher- 
ence) can be accurately approximated by the Nystrom method using an uniform sampling. 
This result is generalized to noisy observation in (Mackey et al., 2011) for low rank matrix. 
The main limitation of these results is that they only apply to low rank matrices. Recently, 
A. Gittens (Gittens, 2011) developed a relative error bound of the Nystrom method for 
kernel matrices with an arbitrary rank using a slightly different coherence measure. Unlike 
the previous studies, we focus on the error bound of the Nystrom method for kernel ma- 
trices with an arbitrary rank and a skewed eigenvalue distribution. The main result of our 
analysis is given in the following theorem. 

Theorem 9 Assume the eigenvalues \/N,i = 1,...,N follow a p-power law with p > 1. 
Given a sufficiently large number of samples, i.e., 



4. We assume a power law distribution for the normalize eigenvalues Ai/N because the eigenvalues Ai of K 



0~k < ck p . 





scales in N. 
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we have, with a probability 1 — 2N 3 , 



K - K h K^Kl 



2 \ mP~ l 



where O(-) suppresses the polynomial factor that depends on InN, and C a b is a numerical 
constant as revealed in our later analysis. 

Remark 10 Compared to the approximation error in (1), Theorem 9 improves the bound 
from 0{N / yfm) to 0(N/m p ~ 1 ) provided the eigenvalues of kernel matrix follow a power law. 
For the relative error bound given in (Gittens, 2011), the approximation error is dominated 
by 0(N 2 /m p+1 ) for eigenvalues following a p-power law. It is straightforward to see that 
the result in Theorem 9 is better than 0(N 2 /m p+1 ) when m < y/N, a favorable setting 
when N is very large and m is small. Finally it is worth noting that similar to (Talwalkar 
and Rostamizadeh, 2010; Mackey et al, 2011; Gittens, 2011), the bound in Theorem 9 is 
meaningful only when the coherence [i of the eigenvector matrix is small (i.e., the eigenvector 
matrix satisfies the incoherence assumption). 

We emphasize that the result in Theorem 9 does not contradict the lower bound given 
in Theorem 4 because Theorem 9 holds only for the cases when eigenvalues of the kernel 
matrix follow a power law. In fact, an updated lower bound for kernel matrix with a skewed 
eigenvalue distribution is given in the following theorem. 

Theorem 11 There exists a kernel matrix K £ M. NxN with all its diagonal entries being 1 
and its eigenvalues following a p-power law such that for any sampling strategy that selects 
m columns, the approximation error of the Nystrom method is lower bounded by 
i.e., 



K 



2 V mP 



provided N > 64[ln4] 2 m 2 . 



We skip the proof of this theorem as it is almost identical to that of Theorem 4. The gap 
between the upper bound and the lower bound given in Theorems 9 and 11 indicates that 
there is potentially a room for further improvement . 

Next, we present several theorems and corollaries to pave the path for the proof of Theo- 
rem 9. We borrow the following two theorems from the compressive sensing theory (Candes 
and Romberg, 2007) that are the key to our analysis. 

Theorem 12 (Theorem 1.2 from (Candes and Romberg, 2001)) LetV be anNxN orthog- 
onal matrix (V T V = I) with coherence fi. Fix a subset T of the signal domain. Choose a 
subset S of the measurement domain of size \S\ = m uniformly at random. Suppose that the 
number of measurements m obeys m > \T\/j, 2 max (C a In |T|, Cb ln(3/5)) for some positive 
constants C a and Cb- Then 



Pr 



m 



Vs,tVs,t - 1 



> 1/2 < 5. 
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Theorem 13 (Lemma 3.3 from (Candes and Romberg, 2007)) Let V , S, andT be the same 
as defined in Theorem 12. Let uj be the k-th row ofVj^Vs^T- Define a 2 = /i 2 mmai (1, n\T\ / 1 ^fm) . 
Fix a > obeying a < (m//x 2 ) 1//4 if n\T\/ \pm > 1 and a < (m/^jTl]) 1 / 2 otherwise. Let 
z k = (VgrpVs^)* 1 ^. Then, we have 



rr sup ||zfc|| 2 > z^VW/ 



m 



k£T c 



< iVexp(- 7 a 2 ) + Pr (\\Vj tT V s , T h < ^) 

for some positive constant 7, where T c stands for the complementary set to T. 

Combining the results from Theorem 12 and Theorem 13, we have the following high 
probability bound for sup fcgT c ||zfc||2. 

Corollary 14 If \T\ > max (c ab ln(3iV 3 ), 4^) , and 



lniV x 2 



^max ^|T|C afe ln(3iV a ),16 J J <m<fi 2 \T\ 2 , 

where C a b = max(C a ,Cb), then with a probability 1 — 2N~ 3 , we have 

SUP ||z fe || 2 < 4m/— . 

k( z T c V m 

Using Corollary 14, we have the following bound for £(7i a , r). 
Theorem 15 If r > max(C a6 ln(3iV 3 ), 4 In N/j) and 



[if max ( rC ab ln(3N 6 ), 16 ( ) | < m < tfr 



lniV x 2 



2„2 



then, with a probability 1 — 2N 3 , we have 

£(H a ,r)<^ V A, 



1C 2 N 
16/i r 

i=r+l 



Proof For the sake of simplicity, we assume that the first m examples are sampled, i.e., 

V = {xi, . . . ,x m }. For any g <E H r h , we have g(-) = Ya=i w i^l ' ^i(0> witn Ya=i w i ^ L 
Below, we will make specific construction of / based on g that ensures a small approximation 
error. Let / be 



N 

M 2 



/(■) = Y a J K ( x i' •) = Y ^(ov Y a 3 v ^ 

j=\ i=l \j=l 

N 



i=l 
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where b{ = Y^j=i a jVj,ii * = 1, - - - , iV, and the value of a = (ai, . . . , a m ) T will be given later. 
Define T = {1, . . . , r} and S = {1, . . . , m}. Under the condition that 

m > rfi 2 max (C a , C h )) ln(3iV 3 ) 



> 



rn 2 max (C a In r, C 6 ln(3iV 3 )) , 



Theorem 12 holds, and therefore with a probability at least 1 — N ' 

7TT7 < A min (^tVs^) < A max (VJ T V S} t) < 7TT7- 



2iV 



2N 



(7) 



We construct a as a = Vs,t VgrpVs^r w. where w = (toi, . . . , uv) T . Since 



b = tfLa = vXy 5iT (vI t v s ,t) w 



where b = - , 6tv) t , it is straightforward to see that 6j = utj for j G T. Using the 

result from Corollary 14, we have, with a probability at least 1 — 2N~ 3 , 



max \bj\ < max \\z~ U w 2 ^ 4u A / — , 
jer c jeT c V ?n 



where zj is the j-th row of matrix VJ^Vs^t (^st^J ■ We * nus obtain 



11/ -9 



2 



E X i /2b i l Pi(- 
i£T c 



< 



16fx 2 r 



N 



111 



E * 



i=r+l 



Hence, 



£{U a ,r) = max min ||/ - < 



16fi 2 r 



gem feHa 



m 



E * 

i=r+l 



Remark 16 is worthwhile to compare the result in Theorem 15, i.e., £(H a , r) = O \^i 2 r ^2 

to the relative error bound given in (Gittens, 2011), i.e., £(H a ,r) < O (X r+ iN/m). In 
the case when the eigenvalues decay fast (e.g., eigenvalues follow a power law), we have 



N 



N 



=r+l 



Xi <C N\ r +i, and therefore our bound is significantly better than the relative bound 



in (Gittens, 2011). On the other hand, when eigenvalues follow a flat distribution (e.g., 
Xi m A r+ i for all i E [r + 2,N]), we have £^ r+1 Aj « NX r+ \, and therefore our bound is 
worse than the relative bound in (Gittens, 2011) by a factor of fi 2 r. 

Finally, we show the proof of Theorem 9 using Theorem 15. 



Proof [Proof of Theorem 9] Let r 



■m 



, then 



/i 2 C ab ln(3iV 3 ) 
H 2 rC ab \n(3N 3 ) < m < f^r 2 , 
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where the right inequality follows that r > — 

2/z 2 C afe ln(3^j 

Then the conditions in Theorem 15 hold and we have 



and m > VC 2 b ln 2 (3iV 3 ). 



K - KbK^Kl 



< max (£ (H a , r),X r +x) 



< max 



16// 2 r 



■m 



\ N 

l) E a, 

7 i=r+l 



Since max(16// 2 r/m, 1) < 0(1) due to the specific value we choose for r, and X^ r +i ^* — 
0(N/r p ~ 1 ) due to the power law distribution, then 



K - K h K^K~l 



< O 



N 



-i 



< O 



N 



mP 



4. Application of the Nystrom Method to Kernel Classification 

Although the Nystrom method was proposed in (Williams and Seeger, 2001) to speed up 
kernel machine, few studies examine the application of the Nystrom method to kernel clas- 
sification. In fact, to the best of our knowledge, (Williams and Seeger, 2001) and (Cortes 
et al., 2010) are the only two pieces of work that explicitly explore the Nystrom method for 
kernel classification. The key idea of both works is to apply the Nystrom method to approx- 
imate the kernel matrix with a low rank matrix in order to reduce the computational cost. 
More specifically, we consider the following optimization problem for kernel classification 

A 1 N 

mm £ N (f) = -H/IIL + -E«(x*)), (8) 

i=l 

where yi € {— 1,+1} is the class label assigned to instance jq, and l(z) is a convex loss 
function. To facilitate our analysis, we assume (i) £(z) is strongly convex with modulus a, 
i.e. |^"(^)| > o 5 , and (ii) i(z) is Lipschitz continuous, i.e. |^'(z)| < C for any z within 
the domain. Using the convex conjugate of the loss function £(z), denoted by £*(a), a G £1, 
where f2 is the domain for dual variable a, we can cast the problem in (8) into the following 
optimization problem over a 



N 

y 



1 1 



with the solution / given by / = — j^t J2i=i a i2/« K ( x i> ")• By the Fenchel conjugate theory, 
we have maxlal 2 < C 2 . because l^'fzll < C. 



5. Loss functions such as square loss used for regression and logit function used for logistic regression are 
strongly convex 
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To reduce the computational cost, Williams and Seeger (2001) and Cortes et al. (2010) 
suggest to replace the kernel matrix K with its low rank approximation K = K^K^K^ , 
leading to the following optimization problem for a 



1 N 1 

-lEW-^laoy^oy). (io) 

1 J*=l i=l 

One main problem with this approach is that although it simplifies the computation of 
kernel matrix, it does not simplify the classifier /, because the number of support vectors, 
after the application of the Nystrom method, is not guaranteed to be small (Dekel and 
Singer, 2006; Joachims and Yu, 2009), leading to a high computational cost in performing 
function evaluation. 

We address this difficulty by presenting a new approach to explore the Nystrom method 
for kernel classification. Similar to the previous analysis, we randomly select a subset of 
training examples, denoted by V = (xi, . . . ,x m ), and restrict the solution of /(•) to the 
subspace 1-L a = span(«(xi, •),..., «(x m , •)), leading to the following optimization problem 

\ i N 

mm C N (f) = ±\\f\\l K +-J2£(yif(xi)). (11) 

i=l 

The following proposition shows that the optimal solution to (11) is closely related to the 
optimal solution to (10). 

Proposition 17 The solution f to (11) is given by 

i=l 

where z = K^Kja and a is the optimal solution to (10). 

It is important to note that the classifier obtained from (11) is only supported by the 
sampled training examples in T>, which significantly reduces the complexity of the kernel 
classifier compared to the approach suggested in (Williams and Seeger, 2001; Cortes et al., 
2010). We also note that the proposed approach is equivalent to learning a linear classifier 
by representing each instance x with the vector 

0(x) = D-^ 2 V T («(xi,x), . . . , K(x m ,x)) T , 

where D is a diagonal matrix with non-zero eigenvalues of K, and V is the corresponding 
eigenvector matrix. Although this idea has already been adopted by practitioners, we are 
unable to find any reference on its empirical study. The remaining of this work is to show 
that this approach could have a good generalization performance provided that the eigen- 
values of kernel matrix follow a skewed distribution. Below, we develop the generalization 
error bound for the classifier learned from (11). 

Let /at and be the optimal solutions to (8) and (11), respectively. Let /* be the 
optimal classifier that minimizes the expected loss function, i.e., 

/* = argminP(^ o /) 4 E (x>2/) [£(yf(x))] . 
feH K 
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Let H/lll = E x [|/(x)| 2 ] denote the £2 norm square of /. In order to create a tight bound, 
we exploit the technique of local Rademacher complexity (Bartlett et al., 2002; Koltchinskii, 
2011). Define V(-) as 



(2 - 

V i=i 



1/2 



Let £ be the solution to e 2 = ip(s) where the existence and uniqueness of e is determined 
by the sub-root property of ip{6~) (Bartlett et al., 2002). Finally we define 



/ /61niV\ , x 

e = max I e, yj I . (12) 

Theorem 18 Assume with a probability 1 — 2N~ 3 , 8{H a ) < r(N,m), where r(N,m) is 
some function depending on N and m. Assume that N is sufficiently large such that 

^i\\fMn K ,\\r\\nJ<^^, 



max(||/^|| L2 ,||r|| L2 )< 6 



2 V 61n7V 

Then, with a probability at least 1 — AN~ 3 , we have 

C 2 T(N m\ 

p(tor N )<p(£ f) + 2\\\f\\ 2 UK + 



+ + + ClCe -N 

A a 

where e is given in (12) and C\ is a constant independent from m and N . By choosing A 
that minimizes the above bound, we have 



P(£ o f a N ) < P{£ o /*) + 4\\r\\n K e z C\/C! 4 

0/^2 r<2 
+ ^_ e 2 + CiCe -N_ 

a 



2Ne 4 



Remark 19 In the case when the eigenvalues of the kernel matrix follow a p-power law 
with p > 1, we have e 2 = 0(N^ P ^ P+1 ^) according to (Koltchinskii and Yuan, 2010), and 
T(N,m) = 0(N/m p ~ 1 ) according to Theorem 9. Applying these results to Theorem 18, the 
generalization performance of ffj becomes 

P(£ o f a N ) < P(£ o /*) + 2A||/* IlL + + C x Ce- N 

2C 3 C 2 N- 2 p/(p+V 2C 4 C 2 N-P/( p+ V , . 

+ — : + — (13) 

A a 
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where C%, C3, and C4 are constants independent from N and m. By choosing A that 
minimizes the bound in (13), we have 



411 f*\\n I jsf2 P /(p+i) 
P(t o fSr) < P(t o /*) + j^cfa + C 2 ^^ 

2C 4 C 2 ~ r» -JV 
= P{£ of*) + (^AT-P/(P+ 1 ) + m-CP" 1 )/ 2 ) . 



As indicated by above inequality, when the eigenvalues of the kernel matrix follow a p- 
power law, by setting m = N 2p ^ p we are able to achieve similar performance as the full 
version of kernel classifier (i.e., 0(N~ P ^ P+1 ^)). In other words, we can construct a kernel 
classifier without sacrificing its generalization performance with no more than AT2p/(p 2 -1) 
support vectors, which could be significantly smaller than N when p > (1 + y/2). For the 
example of kernel that generates Sobolev Spaces W a,2 (T d ) of smoothness a > d/2, where 
T d is d-dimensional torus, its eigenvalues follow a p-power law with p = 2a > d, which is 
larger than (1 + y/2) when d > 3. 



5. Conclusion 

We develop new methods for analyzing the approximation bound for the Nystrom method. 
We show that the approximation error can be improved to 0(N/m 1 ~ p ) in the case when 
there is a large eigengap in the spectrum of a kernel matrix, where p G (0, 1/2) is introduced 
to characterize the eigengap. When the eigenvalues of a kernel matrix follow a p-power law, 
the approximation error is further reduced to 0(N/m p ~ 1 ) under an incoherence assumption. 
We develop a kernel classification approach based on the Nystrom method and show that 
when the eigenvalues of a kernel matrix follow a p-power law (p > 1), we can reduce the 
number of support vectors to N 2p ^ p ~ l \ which could be significantly less than N if p is 
large, without seriously sacrificing its generalization performance. 
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Appendix 

Proof of Theorem 4 

We argue that there exists a kernel matrix K such that (i) all its diagonal entries equal to 
1, and (ii) the first m + 1 eigenvalues of K are in the order of 0(iV/m). To see the existence 
of such a matrix, we sample m + 1 vectors ui, • • • , u m +i, where Uj S M. N , from a Bernoulli 
distribution, with Pr(«jj = +1) = Pr(ujj = — 1) = 1/2. We then construct K as 



where U = (ui,-- - ,u m+1 ). 

First, since Uij = ±1, we have diag(uiuj) = 1, where 1 is a vector of all ones, and 
therefore Ka = 1 for i S [N]. Second, we show that with some probability 1 — 5, all non-zero 
eigenvalues of jjU T U are bounded between 1/2 and 3/2, i.e., 



To prove (15), we use the concentration inequality in Proposition 6. We define £j = ZizJ ,i = 
1, . . . , N, where Zj £ M m is the ith row of the matrix U, and || • || in the above proposition 
as the spectral norm of a matrix. Since every element in z, is sampled from a Bernoulli 




(14) 




(15) 
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distribution with equal probabilities of being ±1, we have E[zjzJ] = I m and ||zjzj|| = m. 
Thus, with a probability 1 — S, we have 







1 






N 



N 

£ 

i=i 



& - m 



< 



Am ln(2/<J) 



N 



When N > 64m 2 [In 4] 2 , for any sampled U, with 50% chance, we have 



N 



:U T U 



< 



which implies (15). 

With the bound in (15) and using the fact that the eigenvalues of UU T equal to the 
eigenvalues of U T U, it is straightforward to see that the first m + 1 eigenvalues of K are 
in the order of Q(N/m). Up to this point, we proved the existence of such a kernel matrix. 
Next, we prove the lower bound for the constructed kernel matrix. 

Let Vw m +i) = (vi, • • • , v m+ i) the first m + 1 eigenvectors of K. We construct g as 
follows: Let u = U 1: ( m+1 )a be a vector in the subspace span(vi,-- - , v m +i) that sat- 
isfies the condition i^^u = 0. The existence of such a vector is guaranteed because 
T&nk(Kfr Vi:(m+i)) — m - We normalize a such that ||a||2 = 1. Then we let g = J2iLi M i K ( x ii ') 
Y^I=i w i^fK¥i{')i where w = V^, m+1 ^u. It is easy to verify that (i) g € Tib since 
1 1 u. 1 1 2 = ||^i:(m+i) a l|2 = 1) and (ii) g _L % a since u T Ki> = 0. Using g, we have 

m+l 

£(H a ) = max min ||/ - g\\^ K > \\g\\^ K = ^2 w i Xi 



g£H b f£U a 

a- N 



t=l 



m 



1 



w 



> O 



where we use 1 1 w 1 1 2 
the fact £(H a ) 



I V lT( m + 1) ^1 : ("1+ 1) a l 1 2 



I a. 1 1 2 = 1- We complete the proof by using 



K - K h K^K h T 



Proof of Corollary 8 

Define £(xj) to be a rank one linear operator, i.e., 

e(x i )[/](.) = «(5 i) -)/(5£i). 

Apparently, L m = — YmLi an d = Ljy. We complete the proof by using the 

result from Proposition 6 and the fact 



\m k )\\ 



us 



N 



N 



K(x fc ,Xfc) < 1, 



where the last equality follows equation (3). 
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Proof of Corollary 14 

We choose a = 2-y/ln N/j in Theorem 13. Since m > 16/i 2 (^tO , then we have a < 

1/4 

. Additionally, by having [i\T\/y/m > 1, the conditions in Theorem 13 hold, and by 



m 
IP 

3 



setting 5 = N 6 in Theorem 12, the condition in Theorem 12 holds, which together implies 

Pr I sup ||zfc|| 2 > 2fi^/\f\J 

VfceT c / 

m 

2NJ 

1 

V S,T V S,T ~ I 



< iYexp(- 7 a 2 ) + Pr (\\vJ t Vs,t\\2 < 



< N~ 3 + Pr 

< 2iV~ 3 . 

From this we have, with a probability 1 — 2N~ 3 



1 

> - 
~ 2 



/ \T\ ( m\ ' \J /i 3 |T|m 1 / 2 

sup z fc 2 <2/i\ h2 — 

keT c V m / ™ 

= 4 /H 

V 771 



Proof of Proposition 17 

Since 

e(yif(xi)) = maxaiyif(xi) - 
we rewrite the optimization problem in (11) into a convex-concave optimization problem 

A 1 N 

Since / £ T~L a , we write / = z i K (^-ii ")j resulting in the following optimization problem 

A - 1 1 * 

min max — z T Kz + — (a o y) T K^z — — > £*(aA 
zgR- { a ,esi}™ j 2 iV v ^ v ; 



Since the above problem in linear (convex) in z and concave in a, we can switch minimization 
with maximization. We complete the proof by taking the minimization over z. 

Proof of Theorem 18 

To simply our presentation, we introduce notations 

1 N 

i=l 

A(f) = P(£of)-P(£of*). 
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Using Pn(( °f), we can write £jv(/) = Pn(& ° /) + ill/11^ ■ We first prove that 

C N (f N ) < C N (f a N ) + ^jWa), 
where max ze Q \z\ 2 < C 2 . Note that 
£nUn) 



N 

1 

max 



1 N 1 

= wKSl S 4(ai) " v^ {a y)TK(a ° y) - 



Then 



£/v(/iv) 

1 * 1 

max -T7^^*( aj ) ~ ^YAf2^ 0y ) TK ^ 0y ^ 



+ 2p (a ° y)T(K " %0y) 
1 * 1 
-J§& -]v^ 4(ai) -2A^ (a0y)TK(a0y) 

+ ma \ TTTT72(«°y) T (^- ^)(«°y) 

<^ra + ^2ii«iilii^-^ii 2 



Then we proceed the proof as follows 
A 



'j\\fMHn+P(*°fN) 

A, 



< p N (£o f%) + -uml + (p- p N )(i o r N ) 

< P N (£of N ) + ^\\f N f HK + ^£(Ha) 



(P-P N )(£oft) 

°N(t°t) + ^\\r\ 

+- (p-p N )(eof%), 



<PN(£on + ^\\n 2 H K + ^m a ) 
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where the third inequality follows from the fact that is the minimizer of P/v(^ o /) + 
Ill/Ill^. Hence, 

< ^11/iL - ^m\\L + ^(^) 

+ (p-p N )(£of%-eof*). 

Let r = I)/* - f^\\ L2 and R = \\f* - /#|| w „. Define 

G(r, R)={fen K : ||/ - /* \\ L2 < r, ||/* - f\\ Hlt < R} . 
Using the domain Q, we rewrite the bound for A(/^) by 

Hf a N ) < ^ii/iL - ^WfML + ^*(«.) 

+ sup (P-P N ){£of-£of*). 
feG(r,R) 

Since er < e N and e 2 R < e N , using Lemma 9 from (Koltchinskii and Yuan, 2010), we have, 
with a probability 1 — 2N~ 3 , for any 

sup (P-P N )(£of-£o /*)) < C\C(re + Re 2 + e~ N ), 
feG(r,R) 

where C\ is a constant independent from N. Thus, with a probability at least 1 — 4iV~ 3 , 
we have 

-N 



A(/ft) - CiCe 

< A |i f *|i2 A 2 C 2 r(iV, m 



2 n " 2 nK 2AiV 
+ CiCe||/& - f*\\L 2 + CiCe 2 ))/* - f%\\ Hlt 
< A 2 A 2 C 2 r(7V,m) 
- 2 ll/ " 2 ll/jvl1 ^ + 2AiV 

^2^2,2 _ ,"r2/-r2,4 x 

H t-jWJN-J \\l 2 i ^ 1" 1 11/ -/ivll^ K 

*l|2 



.A 2 A 3 C 2 r(iV,m) A 
-ttII/ ll«« ~~ o H/Jvllw re "I WVat i" tt ii j im 



2 n " 2 " J "" nK 2\N 2 

a 1 1 2 



r 2 r 2J2 _ (^2 7-2^4 \ 

+ + ~u/£ - r ill + ^ + ^uw 

^m,,*i,2 C 2 F(N,m) C 2 C 2 e 2 C 2 C\ 4 



(T , 



r*il2 



+ ][ ll/iv ~~ / ilia 

< aii rn 2 + c2r ^ m ) + + c2cV + 1 A r r ) 

< All/ |k K + 2AiV + + + 2 (/iv) ' 



2 l2 

where in the second inequality we apply Young's inequality ab < ^- + -j- twice, the last 
inequality follows from the strong convexity of £(z) and /* is the minimizer of P(£ o /) = 
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E( x y )[£(y/(x))]. Thus, with a probability at least 1 — AN 3 , we have 

piiofti) <p{^n + 2\\\r\\i iK + c 



XN 

+ ±±±_ + + Cl Ce- N . 

a A 

We complete the proof by minimizing over A in the R.H.S. of the above inequality. 
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